# Multimodal Vision-Language

Qwen2.5 VL 7B Instruct Gemlite Ao A8w8
Apache-2.0
This is a multimodal large language model quantized with A8W8, based on Qwen2.5-VL-7B-Instruct, supporting vision and language tasks.
Image-to-Text Transformers
Q
mobiuslabsgmbh
161
1
Llava 1.5 13b Hf I1 GGUF
This project provides weighted/matrix quantized versions of the llava-1.5-13b-hf model, including various quantization types to meet the usage requirements in different scenarios.
Text-to-Image Transformers English
L
mradermacher
332
1
Spaceqwen2.5 VL 3B Instruct I1 GGUF
Apache-2.0
SpaceQwen2.5-VL-3B-Instruct is a 3B-parameter vision-language model focused on spatial reasoning and multimodal tasks.
Text-to-Image English
S
mradermacher
459
0
VLM R1 Qwen2.5VL 3B OVD 0321
Apache-2.0
A zero-shot object detection model based on Qwen2.5-VL-3B-Instruct, enhanced with VLM-R1 reinforcement learning, supporting open vocabulary detection tasks.
Text-to-Image Safetensors English
V
omlab
892
11
Eagle2 1B
Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.
Image-to-Text Transformers Other
E
nvidia
1,791
23
Eagle2 2B
Eagle2 is a high-performance vision-language model family introduced by NVIDIA, focusing on enhancing the performance of open-source vision-language models through data strategies and training approaches. Eagle2-2B is the lightweight model in this series, achieving outstanding efficiency and speed while maintaining robust performance.
Text-to-Image Transformers Other
E
nvidia
667
21
Minivla Libero90 Prismatic
MIT
MiniVLA is a 1-billion-parameter vision-language model compatible with the Prismatic Vision-Language Model codebase, suitable for robotics and multimodal tasks.
Image-to-Text Transformers English
M
Stanford-ILIAD
127
0
Paligemma2 28b Mix 224
PaliGemma 2 is an upgraded vision-language model launched by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual image-text interaction tasks.
Image-to-Text Transformers
P
google
2,050
4
Paligemma2 28b Mix 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image+text input and text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
198
26
Paligemma2 10b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output
Image-to-Text Transformers
P
google
233
32
Paligemma2 10b Pt 448
PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.
Image-to-Text Transformers
P
google
282
14
Paligemma2 3b Pt 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
3,412
45
Paligemma2 3b Pt 224
PaliGemma 2 is a vision-language model (VLM) developed by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting multilingual vision-language tasks.
Image-to-Text Transformers
P
google
30.51k
148
Paligemma2 10b Mix 224
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
701
7
Paligemma2 3b Mix 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text inputs with text generation output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
20.55k
44
Paligemma2 3b Ft Docci 448
PaliGemma 2 is an upgraded vision-language model released by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual vision-language tasks.
Image-to-Text Transformers
P
google
8,765
12
Llama 3.1 8B Dragonfly V2
Dragonfly is a multimodal vision-language model fine-tuned with instructions based on Llama 3.1, supporting joint understanding and generation of images and text
Image-to-Text English
L
togethercomputer
113
1
Openvla V01 7b
MIT
OpenVLA v0.1 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, supporting various robot controls.
Text-to-Image Transformers English
O
openvla
30
10
Paligemma 3b Pt 448
PaliGemma is a lightweight and versatile vision-language model built on the SigLIP vision model and Gemma language model, supporting multilingual image-text interaction tasks.
Image-to-Text Transformers
P
google
2,708
29
Paligemma 3b Pt 896
PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.
Image-to-Text Transformers
P
google
1,788
119
Paligemma 3b Ft Refcoco Seg 896
PaliGemma is a lightweight vision-language model developed by Google, built upon the SigLIP vision model and Gemma language model, supporting multilingual text generation and visual understanding tasks.
Image-to-Text Transformers
P
google
20
6
Paligemma 3b Mix 224
PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.
Text-to-Image Transformers
P
google
143.03k
75
Paligemma 3b Pt 224
PaliGemma is a versatile lightweight vision-language model (VLM) built upon SigLIP vision model and Gemma language model, capable of processing both image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
38.40k
318
Vitamin XL 384px
MIT
ViTamin-XL-384px is a large-scale vision-language model based on the ViTamin architecture, specifically designed for vision-language tasks, supporting high-resolution image processing and multimodal feature extraction.
Image-to-Text Transformers
V
jienengchen
104
20
Internvl 14B 224px
MIT
InternVL-14B-224px is a 14B-parameter vision-language foundation model supporting various vision-language tasks.
Text-to-Image Transformers
I
OpenGVLab
521
37
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase